Michael Clark
Seek to find orthogonal factors to minimize reconstruction error.
\[\mathscr{Loss} = X - ZW^T\]
X is the \(NxD\) matrix of \(N\) observations of \(D\) variables, \(Z\) are the \(N x L\) component scores where L < D, while \(W\) are the \(DxL\) weights, typically referred to as the factor loading matrix.
Each component, or factor, created accounts for less of the variance in the original data. With all components,
\[X = ZW^T\]
The MNIST data contains 28 by 28 pixel images of digits 0-9. If we unroll the data to 784 columns, each row represents a single digit. We can see in the following how well we can reconstruct a digit via PCA. Even with only two components we can get a sense of what we’re looking at. With all components the reconstruction is perfect.
Let’s see an example with more digestible results. The following data are 16 multiple choice ability items taken from the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project. There are 1525 total subjects, and four categories of questions.
library(psych)
cog_ability = psych::iqitemsLet’s do a PCA. We’ll look at four components, though remember that PCA technically returns as many components as there are variables. You’re just seeing part of the results. Note that PCA almost universally requires you to standardize the data, so that each variable has the same variance. Otherwise you’ll just get components reflecting the relative variances. In what follows the PCA is done on the correlation matrix, which amounts to the same thing.
pc = principal(cog_ability, 4)
pcPrincipal Components Analysis
Call: principal(r = cog_ability, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
reason.4 0.60 0.01 0.14 0.09 0.38 0.62 1.2
reason.16 0.67 0.07 0.11 0.04 0.47 0.53 1.1
reason.17 0.50 0.26 -0.08 0.11 0.33 0.67 1.7
reason.19 0.64 0.02 0.14 0.04 0.43 0.57 1.1
letter.7 0.71 0.05 0.02 -0.01 0.51 0.49 1.0
letter.33 0.35 -0.06 0.06 0.58 0.46 0.54 1.7
letter.34 0.64 0.04 0.06 -0.03 0.42 0.58 1.0
letter.58 0.43 -0.17 0.07 0.42 0.40 0.60 2.4
matrix.45 0.64 0.11 0.08 0.10 0.43 0.57 1.1
matrix.46 -0.11 0.26 0.08 0.61 0.46 0.54 1.5
matrix.47 -0.02 0.17 0.06 0.66 0.46 0.54 1.2
matrix.55 0.32 0.32 0.11 0.11 0.23 0.77 2.5
rotate.3 0.09 0.72 0.27 0.12 0.61 0.39 1.4
rotate.4 0.09 0.81 -0.12 0.10 0.70 0.30 1.1
rotate.6 0.17 -0.06 0.84 0.06 0.74 0.26 1.1
rotate.8 0.14 0.22 0.77 0.14 0.68 0.32 1.3
RC1 RC2 RC3 RC4
SS loadings 3.28 1.55 1.49 1.40
Proportion Var 0.20 0.10 0.09 0.09
Cumulative Var 0.20 0.30 0.39 0.48
Proportion Explained 0.42 0.20 0.19 0.18
Cumulative Proportion 0.42 0.63 0.82 1.00
Mean item complexity = 1.4
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.07
with the empirical chi square 2033.18 with prob < 0
Fit based upon off diagonal values = 0.87
First focus on the portion of the output where it says SS loadings . The first line is the sum of the squared loadings1 for each component (in this case where we are using a correlation matrix as the basis, summing across all 16 possible components would equal the value of 16). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 3.28 / 16 = 0.2). The Cumulative Var tells us that 4 components make up over 48% the variance. The others are the same thing just based on the 4 retained components rather than all 16 variables (i.e. the cumulative explained variance would = 1). We can see that the second component accounts for less variance, and this would continue with additional components, where each accounts for a decreasing amount of variance.
Some explanation of the other parts of the output:
h2: the amount of variance in the item/variable explained by the (retained) components. It is the sum of the squared loadings, a.k.a. communality. For example, the first reasoning item shares only 38% of its variance with these components.u2: 1 - h2com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure).Loadings, also referred to as the pattern matrix, in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. As an example, we can reproduce the loadings by correlating the observed variables with the estimated component scores2.
cor(cog_ability, pc$scores, use = 'pair') %>% round(2) RC1 RC2 RC3 RC4
reason.4 0.60 0.01 0.14 0.09
reason.16 0.67 0.07 0.11 0.04
reason.17 0.50 0.26 -0.08 0.11
reason.19 0.64 0.02 0.14 0.04
letter.7 0.71 0.05 0.02 -0.01
letter.33 0.35 -0.06 0.06 0.58
letter.34 0.64 0.04 0.06 -0.03
letter.58 0.43 -0.17 0.07 0.42
matrix.45 0.64 0.11 0.08 0.10
matrix.46 -0.11 0.26 0.08 0.61
matrix.47 -0.02 0.17 0.06 0.66
matrix.55 0.32 0.32 0.11 0.11
rotate.3 0.09 0.72 0.27 0.12
rotate.4 0.09 0.81 -0.12 0.10
rotate.6 0.17 -0.06 0.84 0.06
rotate.8 0.14 0.21 0.77 0.14
It can be difficult to sort it out just by looking at the values, so we’ll look at it visually. In the following plot, stronger loadings are indicated by blue, and we can see the different variables associated with different components.
Interpretation is the fun, but commonly difficult, part. In this case, the variables don’t appear to be grouping like we’d expect, except for some of the reasoning scores3. It’s worth mentioning the naming fallacy at this point. Just because we associate a factor with some concept, doesn’t make it so. In addition, for PCA the goal is not interpretation, but to reduce the data while retaining as much of the variance as possible.
Why do it?
Issues
Other stuff
Now let’s examine what is sometimes called common factor analysis, and also sometimes exploratory factor analysis in the social sciences, and even ‘classical’ or ‘traditional’, but typically just factor analysis (FA) everywhere else. While we can use both PCA and FA for similar reasons (dimension reduction) and even have similar interpretations (in terms of loadings), there are some underlying subtleties between the two that provide unique distinctions. Noting these distinctions with some detail will require some matrix notation, but for readers not so keen on such presentation they may note the images and concluding points.
\[X \approx ZW^T\]
First let’s revisit PCA, where we can depict it conceptually as an approach where we attempt to approximate the correlation matrix in terms of the product of components, represented by our loading matrix \(L\).
\[R = LL'\] and
\[C = XW\]
In other words, each component score \(C\), i.e. the score for a particular observation with regard to the component, is a weighted combination of the \(p\) observed variables \(X\), the weights (with weight/loading matrix \(W\)) of which are determined by the loadings, but yet do not say anything about the correlations between the variables.
Things are different with factor analysis. Now the causal flow is in the other direction, originating with the latent variables.
\[X \approx FW\]
Each observed variable \(x\) is a function of the latent variables that it is associated with. In addition, we also take into account the uniquenesses \(\Psi\), or that part which the factors do not explain.
And to reproduce the correlation matrix:
\[R \approx LL'\] \[R = LL' + \Psi\]
So if we just use the loadings from the FA, we cannot reproduce the correlation matrix exactly, we need to add the uniquenesses as well.
What this amounts to conceptually are a few key ideas:
fa_model = fa(cog_ability, 4, rotate='oblimin')
fa_modelFactor Analysis using method = minres
Call: fa(r = cog_ability, nfactors = 4, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR3 MR2 MR4 h2 u2 com
reason.4 0.53 0.04 -0.04 0.04 0.30 0.70 1.0
reason.16 0.61 0.03 0.03 -0.04 0.40 0.60 1.0
reason.17 0.43 -0.05 0.11 0.08 0.22 0.78 1.2
reason.19 0.57 0.04 -0.03 0.00 0.34 0.66 1.0
letter.7 0.67 -0.03 0.03 -0.09 0.43 0.57 1.0
letter.33 0.33 0.02 -0.04 0.27 0.21 0.79 2.0
letter.34 0.57 0.00 0.00 -0.05 0.32 0.68 1.0
letter.58 0.38 0.03 -0.07 0.14 0.19 0.81 1.4
matrix.45 0.58 0.00 0.00 0.07 0.36 0.64 1.0
matrix.46 -0.05 0.01 0.04 0.49 0.25 0.75 1.0
matrix.47 0.03 0.03 0.05 0.37 0.17 0.83 1.1
matrix.55 0.27 0.04 0.09 0.15 0.15 0.85 1.8
rotate.3 0.00 0.25 0.46 0.08 0.36 0.64 1.6
rotate.4 0.02 -0.04 0.80 0.00 0.64 0.36 1.0
rotate.6 0.05 0.65 -0.11 -0.03 0.43 0.57 1.1
rotate.8 -0.01 0.68 0.08 0.04 0.49 0.51 1.0
MR1 MR3 MR2 MR4
SS loadings 2.68 1.01 0.95 0.60
Proportion Var 0.17 0.06 0.06 0.04
Cumulative Var 0.17 0.23 0.29 0.33
Proportion Explained 0.51 0.19 0.18 0.11
Cumulative Proportion 0.51 0.70 0.89 1.00
With factor correlations of
MR1 MR3 MR2 MR4
MR1 1.00 0.40 0.19 0.16
MR3 0.40 1.00 0.12 0.37
MR2 0.19 0.12 1.00 0.38
MR4 0.16 0.37 0.38 1.00
Mean item complexity = 1.2
Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 120 and the objective function was 2.6 with Chi Square of 3944.59
The degrees of freedom for the model are 62 and the objective function was 0.06
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.02
The harmonic number of observations is 1523 with the empirical chi square 100.71 with prob < 0.0014
The total number of observations was 1525 with Likelihood Chi Square = 91.43 with prob < 0.0089
Tucker Lewis Index of factoring reliability = 0.985
RMSEA index = 0.018 and the 90 % confidence intervals are 0.009 0.025
BIC = -363.01
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
MR1 MR3 MR2 MR4
Correlation of (regression) scores with factors 0.90 0.82 0.84 0.69
Multiple R square of scores with factors 0.80 0.67 0.70 0.48
Minimum correlation of possible factor scores 0.61 0.33 0.40 -0.04
modelCode = "
verbal =~ reason.4 + reason.16 + reason.17 + reason.19
spatial =~ rotate.3 + rotate.4 + rotate.6 + rotate.8
# not necessary
# verbal ~~ spatial
# if you don't want them correlated
# verbal ~~ 0*spatial
"
famod = cfa(modelCode, data=cog_ability)
summary(famod, standardized=T, rsq=T, nd=2)
pars = parameterEstimates(famod, standardized=T)\[ X \sim \mathcal{N}(ZW^T + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix.
For probabilistic PCA \(\Psi\) is \(\sigma^2I\).
For standard PCA, \(\sigma \rightarrow 0\),
Commonly our data regards counts or otherwise compositional data. In this case we can use techniques that are better suited to such data.
Non-negative matrix factorization is similar to PCA, just with the constraint that scores and weights be positive. It is (in the usual implementation) identical to probabilistic latent semantic analysis.
Latent Dirichlet Allocation takes a different approach to deal with count data. The classic example regards text analysis, where LDA is applied to word frequencies across topics. Consider a document-term matrix where each row regards a document, and each column a word or term. LDA looks for latent topics that are probability distributions over the terms. There is a probability distribution for the topics as well as the terms, and given both, we will see some occurrence of the term in each document. Each document can be seen a mix of topics.
In the factor analysis sense, each variable (the term) will have some probability of occurring for a given topic, i.e. factor. One can think of the term/variable probabilities for a specific topic similar to how we did loadings in the factor analysis sense. Every variable will have some non-zero probability of occurrence for a given factor, but often they are essentially zero. Factor scores are the document topic probabilities.
Given the probabilities of topics and terms within topics, there is a multinomial draw of counts for an observation. With this probabilistic result, there isn’t a ‘reconstruction’ connotation with LDA. However we can simulate a mulinomial draw based on the total counts we see. For example, let’s assume 5 terms seen a total of 20 times given some probability.
t(rmultinom(1, size = 20, prob=5:1)) # first term is most likely, 5th term is least likely [,1] [,2] [,3] [,4] [,5]
[1,] 4 7 5 2 2
Those are the counts we see for a given observation.
LDA wouldn’t be the best tool if your goal is prediction, and you don’t care much about interpretation.
LDA is basically NMF with Dirichlet priors on the topics.
Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
In some contexts, possibly way too many, some assume the latent variables are discrete in nature rather than continuous. In this case, the dimension reduction occurs along the rows rather than the columns. We’ll talk about two types.
A basic approach for categorical latent variable analysis from a model based perspective[^cluster] could be seen as follows:
Typical methods are assuming an underlying normal distribution, such that each observation is the weighted some of its likelihood under the parameters for each proposed cluster with their respective estimated parameters. For example with two clusters, each with an estimated mean and variance, we get the likelihood for a particular observation for each cluster. We weight them by its estimated probability of membership in that cluster.
As an example, we’ll use mclust on the old faithful data, positing 2 clusters.
library(mclust)
mod = Mclust(faithful, G = 2)
summary(mod)----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VVE (ellipsoidal, equal orientation) model with 2 components:
log.likelihood n df BIC ICL
-1132.187 272 10 -2320.433 -2320.763
Clustering table:
1 2
175 97
We can see that things break out more or less as we might expect, though there is a bit more uncertainty with observations in the middle.
The nice thing about model based approaches is that if we want to compare different solutions we can use the same tools we do in regression settings. Here, mclust does a search over many models (not just differences in cluster sizes, but also things like equal variances, shape, etc.) and selects the ‘best’ model according to BIC.
mod = Mclust(faithful, verbose=F)Mixture models are often used in the regression setting, similar to the latent class setting that we’ve been describing. One approach regards the outcome of interest, which you think has latent categories, and you model it as such. In a standard regression setting you would get two sets of output, one for each class, along with estimated probabilities of class membership for each observation.
library(flexmix)
mod = flexmix(eruptions~waiting, data=faithful, k = 2)
summary(mod)
Call:
flexmix(formula = eruptions ~ waiting, data = faithful, k = 2)
prior size post>0 ratio
Comp.1 0.498 148 272 0.544
Comp.2 0.502 124 272 0.456
'log Lik.' -191.5743 (df=7)
AIC: 397.1485 BIC: 422.3891
We can see from the summary about 2/3 are classified to one group. We also get the estimated means and standard deviations for each group, as well as note respective probabilities of each observation for each class. Note that the group labels are completely arbitrary.
parameters(mod) # means (Intercept) and std dev (sigma) for each group Comp.1 Comp.2
coef.(Intercept) -1.73204354 -2.06336751
coef.waiting 0.07822496 0.07374856
sigma 0.34866389 0.39692190
| X1 | X2 |
|---|---|
| 0.06 | 0.94 |
| 0.141 | 0.859 |
| 0.117 | 0.883 |
| 0.07 | 0.93 |
| 0.464 | 0.536 |
| 0.903 | 0.097 |
| 0.382 | 0.618 |
| 0.003 | 0.997 |
| 0.483 | 0.517 |
| 0.243 | 0.757 |
The key idea is as before, that the response is from two distributions, e.g. two normal distributions, each with their own mean, modeled by covariates, and some variance. This very commonly applied in cases where there are many zeros in the data, leading to zero-inflated, hurdle and similar models, where the zeros are modeled by a logistic regression and the counts by Poisson or negative binomial, but the approach is much more general.
K-means is hands-down the most commonly used clustering method. Conceptually it’s fairly straightforward- find \(k\) clusters that minimize the variance of its members from the mean of its members. As such it’s easy to implement in standard data settings.
Hierarchical methods can be thought of as a clustering
Non-model based methods provide no non-arbitrary way of selecting the number of clusters within a particular method, or comparing different approaches to the same solution. Many have been offered and in typical settings they will rarely agree.
You are never going to have it recommended that you take continuous outcomes or predictors and categorize them. It is sometimes done as a computational convenience, but if you’re interested in interpretability and inference, it’s almost always the wrong choice.
Now ask, why would you do it with latent variables? Unless there is very strong theory or an actual physical aspect involved (e.g. )
They are the eigenvalues of the correlation matrix. In addition, they are the diagonal of the crossproduct of the loading matrix.↩
These results have been rotated, something practically no one using PCA actually does.↩
Results like this are more the rule than the exception in my experience. Not enough initial development is typically done with scales, and when other people use them in other settings, it often ends with disappointment. Just think, many psychological scales are developed on freshman psychology students. How well do you think that will generalize?↩